In this research, we are going to work with an hourly electricity consumption data in order to fit a model with which the consumption of the next two weeks will be forecasted.
The hourly electricity consumption data is imported from Transparency Platform by EPİAŞ The consumption data from 1st of January, 2016 till the 20th of May, 2021 is used in this analysis.
The data is downladed as a csv file and imported to R with read_csv() function.
After certain manipulations, the data looks like this.
head(df)
## DateTime Consumption
## 1: 2016-01-01 00:00:00 26277.24
## 2: 2016-01-01 01:00:00 24991.82
## 3: 2016-01-01 02:00:00 23532.61
## 4: 2016-01-01 03:00:00 22464.78
## 5: 2016-01-01 04:00:00 22002.91
## 6: 2016-01-01 05:00:00 21957.08
Also, the consumption is 0 at 2016-03-27, 02:00:00. This is obviously an error which occured while entering or importing the data. In order to improve the analysis, this 0 value is replaced by the consumption of 2016-03-26, 02:00:00
First, let’s visualize the data.
There seems to be a seasonal effect: the consumption increases in summers and winters, and it decreases in falls and springs. Notice that the increase in the electricity consumption is higher in summers as compared to the increase in winters. There is also a slightly increasing trend. Several low consumption days are observed: these are religious and national holidays. Lastly, a decrease in 2020 spring is observed. This is due to the Covid-19 pandemic.
The data is not stationary due to several reasons discussed above. We can also check whether the data is stationary by a KPSS test.
summary(ur.kpss(df$Consumption))
##
## #######################
## # KPSS Unit Root Test #
## #######################
##
## Test is of type: mu with 18 lags.
##
## Value of test-statistic is: 12.6733
##
## Critical value for a significance level of:
## 10pct 5pct 2.5pct 1pct
## critical values 0.347 0.463 0.574 0.739
H0 in KPSS test states that the data is stationary. The value of test statistic is higher than all the critical values. The outcome declares that there is strong evidence to reject the idea of the consumption data being stationary.
We will try to decompose the time series in order to make it as stationary as possible. The time series object will be decomposed at different levels ( hourly, daily and monthly )
Three plots will be analyzed: trend cycle component, seasonal component and the remaining random component. Trend cycle component explains the trend by moving average. The seasonal component is obtained when the trend cycle component is removed from the time series object. Finally, the random component is obtained when the time series object is deseasonalized and etrended. We expect this random component to be stationary.
There is no increasing variance in the data, therefore the decomposition type will be additive.
The hourly decomposition plots are below.
plot(hourly_dec)
The hourly seasoanlity effect is below.
With this plot, we can observe that the electricity consumption increases from 6 am until 12 am. Then it stays constant until midnight. Then it decreaes until 6 am.
The random component of the hourly decomposed data is below.
plot(hourly_dec$random)
To analyze whether the time series follow a pattern in every 7 days or not, the time series object will be decomposed at daily level, therefore its frequency will be 24*7, which equals to 168.
The daily decomposition plots are below.
plot(daily_dec)
The daily seasoanlity effect is below.
With this plot, we can observe that the consumption is lower at weekends. In addition, Sundays have the least consumption. The reason behind this is simple: most factories are closed on Sundays.
The random component of the daily decomposed data is below.
plot(daily_dec$random)
To analyze whether the time series follow a pattern in every 12 months or not, the time series object will be decomposed at monthly level, therefore its frequency will be (24* 7)*52, which equals to 8736.
The monthly decomposition plots are below.
plot(monthly_dec)
The monthly seasonality effect is below
By observing the seasonality plot, one can state that there is high consumption during winters and summers. The reason may be high and low temperatures hence the need of air conditioning. Also, summer time consumption is slightly higher than winter time consumption.
The random component of the monthly decomposed data is below.
plot(monthly_dec$random)
As the frequency of the time series increases, trend component becomes smoother. There is less ups and downs as we search for a seasoanlity in a higher frequency.
We will continue our analysis with the time series object which has 7*24 as frequency. This means that both the hour and the day of the observation define the seasonality.
We have already decomposed the time series object with frequency 7*24
The daily decomposition plots are below.
plot(daily_dec)
The hourly seasoanlity effect is below.
As we have already discussed; 1) the consumption is lower in the weekends, especially on Sunday. 2) the consumption increases from 6 am until 12 am. Then it stays constant until midnight. Then it decreaes until 6 am.
plot(daily_dec$random)
The random component of the time series object seems to be random.
##
## #######################
## # KPSS Unit Root Test #
## #######################
##
## Test is of type: mu with 18 lags.
##
## Value of test-statistic is: 0.0042
##
## Critical value for a significance level of:
## 10pct 5pct 2.5pct 1pct
## critical values 0.347 0.463 0.574 0.739
H0 in KPSS test states that the data is stationary. The value of test statistic is significantly lower than all the critical values. The outcome declares that there is no evidence to reject the idea of the consumption data being stationary. Therefore we can state that the random component is stationary. We can now start building models.
It is also essential to analyze the ACF and PACF plots when building a model.
The ACF is sinusoidal, and in PACF there are two significant spikes at the first two lags. However, there is also a significant spike at lag 25 in the PACF. By observing the ACF and PACF plots, we can state that this time series can not be modeled only by AR or only by MA. We will need to build an ARMA model.
Now, several AR models will be built.
ar1 <- arima(df[,Random], order = c(1,0,0))
ar2 <- arima(df[,Random], order = c(2,0,0))
ar3 <- arima(df[,Random], order = c(3,0,0))
ar4 <- arima(df[,Random], order = c(4,0,0))
ar5 <- arima(df[,Random], order = c(5,0,0))
AIC_ar <- c(ar1=AIC(ar1), ar2=AIC(ar2) , ar3=AIC(ar3), ar4=AIC(ar4) , ar5= AIC(ar5))
which.min(AIC_ar)
## ar5
## 5
AR 5 model has the lowest AIC, hence its the best model among them.
Now, several MA models will be built.
ma1 <- arima(df[,Random], order = c(0,0,1))
ma2 <- arima(df[,Random], order = c(0,0,2))
ma3 <- arima(df[,Random], order = c(0,0,3))
ma4 <- arima(df[,Random], order = c(0,0,4))
ma5 <- arima(df[,Random], order = c(0,0,5))
AIC_ma <- c(ma1=AIC(ma1),ma2=AIC(ma2), ma3=AIC(ma3) , ma4=AIC(ma4),ma5=AIC(ma5))
which.min(AIC_ma)
## ma5
## 5
MA 5 model has the lowest AIC, hence its the best model among them.
Remark: As the models’ order increase, AIC value continues to decrease. However,increasing the order is computationally costly. Therefore we selected the best AR and MA models among five models.
AR 5 model yields a lower AIC value than MA 5 model. However, the model can be improved.
Let’s combine the selected AR 5 and MA 5 models and build an ARMA model.
arma1 <-arima(df[,Random], order = c(5,0,5))
AIC(arma1)
## [1] 715002.6
This model yields an AIC value of 715038.9, which is smaller than the AR 5 model. This model is complex and computationally costly. Let’s decrease the complexity by decreasing the number of parameters.
arma2 <- arima(df[,Random], order = c(4,0,4))
AIC(arma2)
## [1] 715073.8
With this new model, both AIC value and the complexity of the model decrease. Therefore this model will be used for forecasting. The corresponding AIC value is 713871.9.
We should observe how the model fits the data.
The below graph represents the actual and fitted data over time.
In order to better visualize it we can zoom in to a random interval.
The model seems to fit the actual data.
We are going to forecast the electricity consumption between 6th of May to 20th of May in 2021.
The last trend value will be used in the forecasts. The effect of seasonality and the last trend value will be added to forecasts of the random component.
The graph below shows the actual values and forecasts over time.
The forecasts do not fit the actual data well. The model can not reflect some behaviours of the data. A SARIMA model could have fit better.
We can use daily MAPE, daily bias and WMAPE to evaluate the model. accu() function provides the overall metrics.
accu
## n mean sd CV bias MAPE WMAPE
## 1 360 32019.35 4951.71 0.1546475 0.008845775 0.1112664 0.1040904
Overall MAPE and WMAPE are reasonably small, which is a good sign.
Let’s analyze daily bias and daily MAPE.
metrics
## Date DailyMAPE DailyBias
## 1: 2021-05-06 0.08311380 0.08311380
## 2: 2021-05-07 0.07533007 0.07533007
## 3: 2021-05-08 0.10028647 0.10028647
## 4: 2021-05-09 0.11791962 0.11791962
## 5: 2021-05-10 0.05645427 0.05185319
## 6: 2021-05-11 0.04432371 0.03074612
## 7: 2021-05-12 0.15063255 -0.12034067
## 8: 2021-05-13 0.31266082 -0.31266082
## 9: 2021-05-14 0.25477144 -0.25477144
## 10: 2021-05-15 0.14366922 -0.14366922
## 11: 2021-05-16 0.04631920 0.01340022
## 12: 2021-05-17 0.05723857 0.05723857
## 13: 2021-05-18 0.08374880 0.08374880
## 14: 2021-05-19 0.05133373 0.05133373
## 15: 2021-05-20 0.09119342 0.09119342
At some days, such as May 5th, daily MAPE and daily bias is very high. However, overall bias and overall MAPE are reasonably small.
To conclude, we have built an ARMA(4,0,4) model to forecast the electricity consumption between May 6th and May 20th 2021. We evaluated the forecasts and the model.